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Abstract 

Background: Wiener and Granger have introduced an intuitive concept of causality (Granger causality) between 
two variables which is based on the idea that an effect never occurs before its cause. Later, Geweke has 
generalized this concept to a multivariate Granger causality, i.e., n variables Granger-cause another variable. 
Although Granger causality is not "effective causality" in the Aristothelic sense, this concept is useful to infer 
directionality and information flow in observational data. Granger causality is usually identified by using VAR 
(Vector Autoregressive) models due to their simplicity. In the last few years, several VAR-based models were 
presented in order to model gene regulatory networks. Here, we generalize the multivariate Granger causality 
concept in order to identify Granger causalities between sets of gene expressions, i.e., whether a set of n genes 
Granger-causes another set of m genes, aiming at identifying and quantifying the flow of information between 
gene networks (or pathways). 

Results: The concept of Granger causality for sets of variables is presented. Moreover, a method for its 
identification with a bootstrap test is proposed. This method is applied in simulated and also in actual biological 
gene expression data in order to model regulatory networks. 
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Conclusions: This concept may be useful to understand the complete information flow from one network or 
pathway to the other, mainly in regulatory networks. Linking this concept to graph theory, sink and source can 
be generalized to node sets. Moreover, hub and centrality for sets of genes can be defined based on total 
information flow. Another application is in annotation, when the functionality of a set of genes is unknown, but 
this set is Granger caused by another set of genes which is well studied. Therefore, this information may be 
useful to infer or construct some hypothesis about the unknown set of genes. 



Background 

Norbert Wiener [1] conceived the notion that, if the prediction of one time series could be statistically 
improved by incorporating the knowledge of past values of a second one, then the latter is said to have a 
"causal" influence on the former. Clive Granger [2] has formalized the prediction idea in the context of 
VAR (Vector Autoregressive) models. More specifically, if the variance of the autoregressive prediction 
error of one time series at the present time is statistically reduced by inclusion of past measurements from 
another time series, then the latter is said to have a Granger causal influence on the former. From this 
definition it is clear that the flow of time plays a crucial role in allowing inferences to be made about 
directional causal influences from time series data. Due to its simplicity and intuitive idea that an effect 
never occurs before its cause, it has been widely used in several areas such as econometrics [3-5] , 
neuroscience [6-8] and more recently, in bioinformatics [9-15]. The idea is that, if a variable x affects a 
variable y, past values of the former should be useful in generating predictions for the latter variable. 

Later, Geweke [16] has generalized the bivariate Granger causality to a multivariate fashion in order to 
identify conditional Granger causality. To illustrate the bivariate and multivariate Granger causalities, 
suppose three processes, where one drives the other two with differential time delays. A pairwise analysis 
would indicate a causal influence from the process that receives an early input to the process that receives 
a late input. Conditional Granger causality may be useful to disambiguate these situations, since it has the 
ability to resolve whether the interaction between two time series is direct or is mediated by another 
recorded time series and whether the causal influence is simply due to differential time delays in their 
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respective driving inputs. 



Nagarajan and Upreti (2008) [14] and Nagarajan (2009) [15] investigated the use of bivariate VAR for 
acyclic approximations of networks composed of two genes by exploring the parameters defined as 
transcriptional noise variance, autoregulatory feedback, and transcriptional coupling strength, which may 
influence some measures of Granger Causality. These authors have shown that under some conditions, 
uni-directional/acyclic approximations may provide meaningful and useful results. 

Frequently, conditional Granger causality is identified by using VAR models due to its simplicity [17]. In 
the last few years, several extensions of the standard VAR model namely, Dynamic VAR (DVAR) [10], 
Sparse VAR (SVAR) [11,18-20] and Nonlinear VAR (NVAR) [12,21-24] have been proposed to model gene 
regulatory networks. DVAR were developed in order to identify time-varying Granger causalities, i.e., to 
identify structural changes along time (cell cycle) in regulatory networks. However, DVAR identifies only 
linear Granger causalities. In order to overcome this limitation, a model capable to identify nonlinear 
Granger causalities, namely, NVAR, was introduced. Another problem in bioinformatics is the high 
dimensional characteristic of gene expression data, where the number of parameters (genes) is higher than 
the number of observations (microarrays). Therefore, constructing regulatory networks with statistical tests 
are difficult. In order to identify linear Granger causalities in this context, SVAR was proposed [11, 18, 19]. 

These models are useful to identify Granger causalities between genes (gene expression signals), however, 
sometimes, it is necessary to identify and quantify Granger causality between sets of genes. Thus, for 
example, one may need to quantify the sum of Granger causalities from one gene cluster to the other gene 
cluster in order to study the total amount of information flow between different regulatory pathways and to 
identify which pathway is crucial. Another application may be the identification of Granger causality 
between a set of genes whose biological function is yet unknown and another set of genes which are better 
studied and for which more characteristics are known. This latter case may help to determine and infer the 
possible biological process affected by the set of genes whose functionalities are not yet known. 

The theoretical generalization of Granger causality between sets of variables has not been sufficiently 
explored. Here, we propose a definition of Granger causality between sets of time series, where m time 
series Granger-cause n other time series, thus generalizing the two previous definitions (bivariate and 
multivariate). Furthermore, a method for identification of Granger causality and a statistical test based on 
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nonparametric bootstrap are also proposed. The results are illustrated by simulations and, also, by 
application of this new concept to actual biological gene expression data. 

Results and Discussion 
Simulation 

In order to study the properties of our method to identify Granger causalities between sets of genes, we 
have designed a simulation study which contains 14 artificial genes in a time series of length T = 100 
(simulation 1). 

Figure 1A illustrates an example of time series obtained by our simulation study. Figure IB illustrates the 
structure of the artificial regulatory network and the interaction between sets (I, II and III). Figure 1C 
represents the true model, i.e., if our approach is working correctly, it must identify the structure 
illustrated in Figure 1C. 

Notice in Table 1 that where there is no edge, the number of false-positives is actually controlled to 5% 
(<~ 500). In bold, the edges which actually are present in the network are illustrated. Notice that, as 
expected, where there is an edge, the number of identifications is higher than where there is no edge. 
Moreover, verify that the number of false-positives is actually controlled to 5%, i.e., the bootstrap 
procedure is working adequately. 

Interpreting the loops (auto-regressions) from I to I and from II to II, they are representing the total 
information flow contained in networks I and II, respectively. Another interpretation for the loops may be 
the density of the network, since this density may represent the sum of the information flow. On the other 
hand, no Granger causality is found in set III, therefore, no loop is present. It is necessary to point out 
that these loops are exactly the CCA(Y l t ,Yl_ 1 \Y t \{Yl_ 1 }) (CCA: Canonical Correlation Analysis), i.e., 
the total connectivity inside set Y l , where i=l, II or III. Notice that CCA is always applied between the 
past of one subset and the present of another subset of the time series. 

Figure 2A illustrates the structure of an artificial network and the interaction between sets of time series 
(Figure 2B) described in simulation 2. In this simulation 2, both approaches, based on CCA and blockwise 
VAR proposed by [25] were applied in order to evaluate the power of the proposed method. The blockwise 
VAR model was applied using all the variables, i.e., a large network was constructed using VAR. The test 
between sets for VAR were performed by testing the estimated coefficients between sets using the Wald's 
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test (whether all of them are equal to zero simultaneously). The Wald's test is equivalent to test the ratio 
of the prediction's error variance proposed by [16]. Notice, in Table 2 that, again, the false-positives are 
actually controlled to 5%. In bold, are the number of times Granger causality was identified in a total of 
10,000 simulations. Analyzing the number of Granger causalitites identified by both methods, the method 
based on CCA is more powerful than the standard VAR. This is due to the CCA characteristic which 
maximizes the correlation by calculating the optimum linear combination of both sets while the blockwise 
VAR sets equal weights to each time series. In a biological sense, time series W 3 (Figure 2) may be 
interpreted as a transcription factor which Granger causes several other genes. The method based on CCA 
is also clearly more powerful in the case of auto loops (from I to I, from II to II and from III to III). 

Biological data 

In order to illustrate a practical application of the proposed model, an actual biological regulatory network 
was modeled. The following 15 genes, namely, RECK, SRC, C-MYC, TEMP 2, TP53, p21, GADD45A, 
FGFRL1, FGF2, FGFR2, FGFR1, NEMO, IKKA, 1KB and NFKB, were selected, the same which were 
used in our previous works [10-12]. The proposed approach was applied to the analysis of HeLa cell cycle 
gene expression data collected by [26]. HeLa is an immortal cell line, derived from a human epithelial 
cervical carcinoma. Its cell cycle is of approximately 16 hours. The gene expression data used in this study 
contains three complete cell cycles, i.e., 48 time points distributed at intervals of one hour. The HeLa gene 
expression data is freely available at: [27]. 

The sets of genes were composed as follows: (I) RECK, SRC, C-MYC and TIMP2; (II) TP53, p21 and 
GADD45A; (III) FGFRL1, FGF2, FGFR2 and FGFR1 and (IV) NEMO, IKKA, 1KB and NFKB. The 
networks illustrated inside each cluster (Figure 3) were obtained by applying the standard VAR model. 
The edge links between the clusters were obtained by the method proposed here. Unfortunately, the 
auto-loops in sets III and IV were not identified probably due to the power of the test. Increasing time 
series length may increase the power of the test and, consequently, identifying the auto-loops. Notice that 
in the simulations, the proposed approach was able to correctly identify the Granger causalities. 

The tumor progression processes, as other pathological and physiological events, are regulated by responses 
to changes in the external environment, and by following coordinated modulation of gene expression in 
order to appropriately adapt to these changes. Several molecules are responsible for controlling cell 
behavior, among which, growth and transcriptional factors display prominent functions in the context of 
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tumorigenesis. The networks of FGF, c-myc, p53 and NFKB, used in this work to demonstrate a biological 
application of the proposed model, are able to control almost all processes involved in development and 
tumor progression. Proliferation and differentiation, cell survival and death, cell division and cycle arrest, 
cell migration, adhesion and invasion are just some examples of cellular phenotypes modulated by these 
molecules [28]. Therefore, the choice of these sets of genes is appropriate for validating the biological 
application of the proposed model. Thus, prediction of the interaction between these important gene 
clusters could elucidate the complex relationship among them and, consequently, allow construction of 
hypotheses about how the interaction between these pathways orchestrate the regulation of processes that 
are often mutually exclusive. 

Several members of the fibroblast growth factors (FGFs) family and their respective specific cell surface 
receptors (FGFRs) play important roles in a variety of normal and pathological processes, including tumor 
transformation. In the tumor context, FGFs are well known for promoting tumor angiogenesis and 
invasiveness, and for acting as mitogenic agents and cell death inhibitors in many cell types [29]. However, 
recent reports have shown that some of these molecules can also promote tumor suppression, depending on 
the cell type and the number and strength of stimuli. FGF2, a gene belonging to set III, is also implied in 
restoring tumor defense mechanisms in malignant cells [30] . 

By controlling a variety of downstream target genes, the p53 transcription factor has many 
tumor-suppressor activities. The p53 protein is modified in more than 50% of human tumors. Thus, many 
processes are subverted due to p53-alteration, such as, for instance: cell cycle arrest, inhibition of 
metastasis and angiogenesis, DNA repair and cell death induction [31]. On the other hand, aberrant c-myc 
activity is associated with genomic instability and tumor progression. This transcriptional factor is 
encoded by a proto-oncogene, whose product modulates transcriptional processes during normal cell 
growth and proliferation [32]. However, in the tumoral context, deregulated c-myc activity is a classic 
molecular marker of poor prognostics in several types of cancers. Moreover, the c-Myc protein is able to 
regulate other important events involved in tumor transformation, including hormone dependence, 
invasiveness and metastatic potential [33]. In all of these processes, c-Myc positively associates with tumor 
progression. Like c-Myc, the NFKB family of transcription factors has been shown to be constitutively 
activated in various human malignancies, namely: leukemias, lymphomas, and a number of solid tumors. 
As the other three pathways analyzed in this work, NFKB is able to control a variety of events by 
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regulating the expression of genes involved in angiogenesis, metastasis, cell growth, proliferation and 
apoptosis. Therefore, NFKB has been implicated in transcriptional upregulation of several growth factors, 
cytokines, adhesion molecules, antiapoptotic proteins and oncogene products [34]. 

Here, we were able to identify the Granger causality between the previously summoned networks. We 
predicted that the set of genes (I) Granger-causes (II), and that (IV) Granger-causes (II) and (III). 
Therefore, by applying our model, we predicted that the c-Myc network is able to induct p53 related genes, 
and that the NFKB pathway positively regulates the FGF and p53 sets of genes (Figure 3). By applying 
the standard VAR model and jointly testing the coefficients [16], it was only possible to identify Granger 
causality from clusters IV to II, confirming the simulation's results (simulation II) that the power of the 
proposed method in actual biological data is superior to the traditional one. 

Previous reports have demonstrated that p53- induced Gl/S cell-cycle arrest is attributed mainly to its 
ability to the transcriptionally upregulatc p21 and repress c-myc [35]. Induction of p21 by p53 is in 
accordance with the predicted connectivity previously reported [10-12]. However, the well established p53, 
acting as a c-myc inhibitor contradicts the identified Granger causality. On the other hand, there are other 
described mechanisms demonstrating that c-myc is able to indirectly induce p53 through Arf regulation. 
Thus, pl4^^ (the A rf product in humans) inhibits Mdm2 and, consequently, stabilizes p53 [36]. 
Therefore, this last mechanism demonstrates that the anticipated relationship of p53 induction by c-myc, 
predicted by our model, is biologically relevant. Several reports linking NFKB and p53 networks are 
available. As mentioned for the previous sets of genes, these transcriptional factors can reciprocally 
regulate each other. NFKB has been reported to induce Mdm2, consequently inhibiting p53 [37]. In 
contrast, NFKB has also been described as an inducer of the p53 promoter [38]. Thus, the positive 
modulation of the p53 pathway by NFKB set of genes, determined by our model, corroborates previously 
described mechanisms. Moreover, these results suggest the occurrence of a synergistic mechanism to induce 
p53, in view of the fact that both c-myc and NFKB networks are involved in up-regulation of this 
transcriptional factor. 

By examining the sign and the magnitude of the canonical weight assigned to each variable, i.e., the 
eigenvectors a and b (see Methods section), it is possible to verify that variables with larger weights 
contribute more to the variables. Moreover, variables whose weights have opposite signs exhibit an inverse 
relationship with each other, and variables with weights of the same sign exhibit a direct relationship. 
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Conclusions 

Comparison between this method here proposed and other approaches such as Bayesian networks, is 
difficult, since, by definition, they analyze/identify different kinds of causality. Moreover, although our 
previous reports (DVAR, SVAR and NVAR) identify time-varying or nonlinear Granger causalities, they 
are limited to the usual "between time series (genes)" . The method we propose here generalizes the 
standard Granger causality to relationships between sets of time series, which, to the best of our 
knowledge, is the first to address this issue. 

Ordinarily, when the number of genes is relatively large, interpreting the pairwise Granger causalities is 
hopeless. Moreover, linear combinations of variables are often interesting and useful for predictive or 
comparative purposes. In addition, pairwise Granger causality does not distinguish direct from indirect 
causalities because it is not conditioned by a third set of time series. The purpose of the definition we 
present for Granger causality is to summarize the associations between the Y£ and F/ sets in terms of few 
carefully chosen covariances. This summarized measure illustrates the strength of Granger causality 
between sets Y t l and V/ . This may be useful to understand the complete information flow from one 
network or pathway to the other, mainly in regulatory networks, where the number of genes is high and 
interpreting gene-gene Granger causalities may be cumbersome. Linking this concept to graph theory, 
some graph theoretical properties, such as sink and source can be generalized to node sets. Moreover, hub 
and centrality for sets of genes can be defined based on its total information flow. Another application is in 
annotation, i.e., when the functionality of a set of genes is unknown, but this set is Granger caused by 
another set of genes which is well studied. Therefore, this information may be useful to infer or construct 
some hypothesis about the unknown set of genes. 

Here, we have assumed that the sets of time series are given. However, in practice, it is necessary to 
identify which time series belongs to each set. To this purpose, biological a priori information may be used 
to define the different gene sets. Thus, for example, if it is known that one set of genes respond to a 
specific drug and there is another set of genes which respond to the same drug, but no literature 
information is available about the existence of a pathway between these two sets, one may use our 
approach in order to obtain some clues about the existence (or not) of this pathway. Another application 
may be the search for crosstalks between under-studied pathways. On the other hand, an objective and 
systematic method based on clustering may be used. The latter is the object of our future studies. 
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Methods 
Granger causality 

In order to formalize the concept of Granger causality between sets of time series, suppose that 3 t is a set 
containing all relevant information available up to and including time-point t. Let Y t , and F/ be sets of 
time series containing k, m and n time series, respectively, where YJ and Y t 3 are disjoint subsets of Y t , i.e., 
each time series only belongs to one set, and thus, k > m + n. Let Y t (h\$st) be the optimal (i.e., the one 
which produces the minimum mean squared error (MSE) prediction) h-step predictor of the set of m time 
series Y\ from the time point t, based on the information in 3 t . The forecast MSE of the linear 
combination of will be denoted by fiy(ft|9f t ). The set of n time series Y° t is said to Granger-cause the 
set of m time series Y\ if 

Q Y (h\%) < n Y (h\%\{Y j s \s < t})for at least one h=l,2,..., (1) 

where 3 t \{Y^|s < t} is the set containing all relevant information except for the information in the past 
and present of Yj. In other words, if Y\ can be predicted more accurately when the information in Y^ is 
taken into account, then is said to be Granger-causal for Y\. 

Applying the idea of Granger causality to regulatory networks, a set of gene expression time series Y^ 
Granger-causes another set of gene expression time series YJ, if linear combinations of Y J t provide 
statistically more significant information about future values of linear combinations of Y\ than considering 
only the past values of Y\ . Thus, past gene expression values of Yf allow the prediction of more accurate 
gene expression values of Y\ . Notice that since this relationship is not reciprocal, Granger causality may 
be interpreted as information flow [39]. Moreover, it is important to highlight that Granger causality is not 
actually inferring "effective" causality in the Aristothelic sense, i.e., interaction of gene products (or 
protein-protein interactions), since the former is based solely on prediction and quantitative criteria, as 
described before, however, this concept may be useful to suggest some insights on molecular interactions 
which may then be experimentally tested. 

Notice that this definition generalizes not only the original bivariate Granger causality (where Y\ and Y^ 
are 1-dimensional) but, also, the multivariate Granger causality (where Y^ is m-dimensional and Y\ is 
1-dimensional), i.e., the bivariate and multivariate cases are special cases of the Granger causality for sets 
of time series. 
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Identification 

Due to the simplicity of notation and concepts, only the identification of Granger causality in an 
Autoregressive (AR) process of order one will be presented, generalization for other orders being 
straightforward. 

Hotelling [40,41] was the first to tackle the problem of identifying and measuring relationships between two 
sets of variables, introducing the Canonical Correlation Analysis (CCA) [42-45]. Canonical correlation 
analysis can be understood as the bivariate correlation of two synthetic variables which are the linear 
combinations of the two sets of observed variables. The original variables of each set are linearly combined 
to produce pairs of synthetic variables which have maximal correlation. 

The present problem consists in verifying whether Y J t Granger-causes Y\ . In the linear case, Granger 
causality for sets of variables may be identified based on the idea initially proposed by Hotelling. Notice 
that linear combinations provide simple summarized measures of a set of variables. 

Therefore, for the linear case, let Y\ (m-dimcnsional) and Y J t (n-dimensional) be two disjoint subsets of 
Y t , where Y t is a fc-dimensional set of stationary time series (k > m + n), in other words, each time series 
only belongs to one cluster. Then, Y^ is Granger non-causal for Y\ if the following condition holds: 



where p is the largest correlation calculated by Canonical Correlation Analysis. Notice that CCA is applied 
to the time lags and not to the instantaneous time step. Furthermore, correlation may be interpreted as 
the square root of R 2 of a linear regression model, where R 2 represents the variance of the response 
variable which may be explained by the regressors, i.e., when partial CCA is applied between the past 
values of one subset and present values of another subset, it is identifying Granger causality between 
groups of time series. 

In order to calculate p, set: 



CCA(Yi,Y>_ 1 \Y t \{Yl_ 1 }) = p = 



(2) 




(3) 



and 
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for some pair of coefficient vectors a and b. Then, we obtain 



(4) 



Var(u) = a'Ccw(Yj)a = a'£ Y ; Y ja, (5) 
Var(v) = b'CouCY^Jb = b'S Y , Y , b, (6) 



■ t-i * t-i 



Cow(u, v) = a!Cov{Y\, Y^Jb = a'S Y , Y; ^b. (7) 
We shall seek coefficient vectors a and b such that 

a'SY^Y^ ^ 

Corr(u, v) = ^ (8) 

V a ' E Y{Y«ayb'E Y i_ iY i_ i b 

is maximized. 

However, notice that the calculations performed above do not identify conditional Granger causalities, i.e., 
Granger causalities partialized by a third set of time series. Therefore, it is necessary to develop a partial 
canonical correlation analysis. 

The crucial point in performing partial canonical correlation analysis [46] is to derive from 
variance-covariance matrices, linear coefficients for combining original variables into canonical variables. It 
turns out that the residualized variance-covariance matrices of Y\ and Y^_ 1 after partialing out the effect 
of vector X = Y t _i\{Yj_ 1 } from both Y\ and Y 3 t _ 1 , provide the solutions. More specifically, suppose 
that, when combined, the three vectors of variables, YJ, Y^ and X have the following partitioned 
variance-covariance matrix: 



yin_i x 



yi-^l yj_iy;_i y?_i x 



(9) 



where the (r, s)th entry of Suv is given by (T — 1) _1 5Zfe=i(Ufer — Ur)(Vfc s — V s ) (T: time series' length). 
The conditional variance-covariance matrix of y\ and y^_ 1 , partialing out the effect of x is given as [47-49]: 
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where 

E y^|x = S yjyj - S yjx S xi S xyj (H) 

^-iy{l« = ^yJ-iy? ~ ^i^^x^xy; ( 13 ) 

Similar to regular canonical correlation analysis [48], the eigenvalues (d = 1, . . . , min(m, n)) from the 
following two matrices A and B, will be the squared partial canonical correlation coefficients for the dth 
canonical functions and the eigenvectors and associated with the eigenvalue Xd will be the linear 
coefficient vectors which combine the original variables into synthetic canonical variables: 



A - ^yjSAjy^lx^/ ^Jx^-itfAfrJI* (15) 



Let Ai > A 2 > . . . > A • / \ be the ordered eigenvalues of matrices A and B, therefore 

J. 11 111 I / 1 1 • / v } 

CCA{y\,yi_ x \y t \{yi_ x }) = p = V^T- 

Again, as in regular canonical correlation analysis, the two matrices A and B, have the same eigenvalues 

but with different eigenvectors, i.e, each eigenvector is proportional to S -1 ^ 2 ■ X j 4 , E" 1 ^ a<j. 

y t _ 1 y t _ 1 |x y t -iytl x y t y t l x 

Statistical test 

The hypothesis to be tested in order to verify the existence of Granger causality between sets of variables is 
as follows: 

H : CCA{Y\, Yf_!|X) =p = 0(Granger noncausality) 
Hi : CCA(Y\, YJ_!|X) = p ^ 0(Granger causality) 
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The bootstrap for our method is based on the block bootstrap [50] . It consists of splitting the data into 
blocks of observations (with overlapping) and sampling the blocks randomly with replacement. To describe 
this bootstrap more precisely, let the data consist of observations YJ and Y 3 t (t = 1, . . . , T). 

With overlapping blocks of length I, block 1 is observations Y/j : h = 1, . . . , I, block 2 is observations 
Y^+i : h = 1, . . . ,1, block 3 is observations Y^ +2 : h = 1, . . . , I and so forth. The bootstrap sample Y" is 
obtained by sampling blocks randomly with replacement from Y\ and laying them end-to-end in the order 
sampled. The bootstrap sample Y J t * is obtained in an analogous way. This block re-sampling is carried out 
in order to capture the dependence structure of neighbourhood obsevations, i.e., autocorrelation. Moreover, 
the resampling of all time series of Y t l is carried out together, in order to capture contemporaneous 
correlations between time series. The same is performed for the time series of Y t J . However, YJ and Y^ are 
re-sampled independently, in order to break the relationship between the response and predictor variables. 

After constructing the bootstrap samples Y" and Y^*, calculate CCAfY", Y^l-JX*) — p*, where 

X* = Y(_ 1 \{Yjl 1 }. Repeat these steps until obtaining the desired number of bootstraps. Then, use the 

empirical distribution of p* to test whether p = (the bootstrap scheme is illustrated in Figure 4). 

Regardless of the block bootstrap that is used, the block length I must increase with increasing time series 
length T to render bootstrap estimators of moments and distribution functions consistent [51-53]. 
Similarly, the block length must increase with increasing sample size to enable the block bootstrap to 
achieve asymptotically correct coverage probabilities for confidence intervals and rejection probabilities for 
hypothesis tests. For the special case of an autoregressive process of order one, [51] showed that the block 
length I that minimizes the asymptotic mean-square error of the variance estimator increases at the rate of 
I (xT3 . Since time series gene expression data are generally short, it is unfeasible to fit an AR model of 
higher orders. However, if a longer time series data becomes available, one may use the algorithm proposed 
by [54] in order to select the block length for the bootstrap procedure. 

Implementation 

The method to identify Granger causality between sets of time series was implemented in R [55] . The R 
code may be accessed in the Supplementary Material. 
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Computational simulation 

Before analyzing actual gene expression data, we conducted Monte Carlo simulations to examine whether 
our approach is able to identify Granger causality between sets of time series and effectively control the 
rate of false positives. 



The artificial network is set as: 



simulation 1 < 



%2,t = P X Zl,t-1 + £ 2.t 

Z\ t ={$xZ\ t _ x -pxZl t _ x +e 3 + 
Za + = {3 x Z2 t_i ~\~ £4 



7 l 

Z 5,t 

7 ll 

7II 

zl l 



e 5 .t 

pxZi-pxZl^ + erj 

P x ^10.4-1 + £ s,t 

pxZK+pxZllt 

-/3xZ™_i+£io, t 



8,t ^10,4-1 

= P x Z\^_ x - 

„.i = P x ^9,t-i 
III « ^ vll 



-1 + e 9 : t 



Vi - /? X Z g t-1 + £ 11)t 
7llf _ ,- 

ylll _ ,- 
Z 13.t - £ 13,i 

_ ,- 

Z 14,t — £ 14,t 

for t = 1, . . . , T, where the noises e^ t (i = 1, . . . , 14) are normally distributed with mean of zero and 
variance of one and I, II and III represent the sets to which each time series Zj ;t belong. The time series 
length is set to 100, i.e., T = 100, and (3 is set to 0.4. 

In order to evaluate the power in identifying Granger causalities between sets of time scries, a comparison 
between the proposed method based on CCA and the standard VAR model was performed. The following 
artificial network is used for the comparison: 
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WlL = 7 x Wl ,_ x - 7 x + 7 x W™ 



12,t-l 



+ eio,i 



<t - 7 x W^t-i - 7 x W™t-i + eu,t 
Wll t - 7 x WH 1 ^ + e 12 , t 
W$ t ='yxW% t _ 1 +e 13 , t 

for t = 1, . . . , T, where the noises ej ;t (i = 1, . . . , 13) arc normally distributed with mean of zero and 

variance of one and I, II and III represent the sets which each time series W, ;t belongs to. The time series 
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length is set to 100, i.e., T — 100, and 7 is set to 0.2. 

For each simulation, 1 and 2, 10,000 networks were generated. For each artificial network, the proposed 
approach was applied to identify Granger causalities between the sets of variables. The threshold to 
discriminate the existence of an edge was p < 0.05, i.e., the false positives rate was controlled to 5% and 
the number of bootstraps was set to 1,000. For the standard VAR model, the Wald's test was used in order 
to identify and test Granger causality between sets. In other words, suppose three sets I, II and III as in 
simulation 2. The coefficients of the VAR model related to the time series of set I to II were simultaneously 
tested in order to verify whether they are equal to zero or not. The same was performed with the 
coefficients I I, I -> III, II -> I, II -> II, III I, III II and III III. 
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Figure 1 - Simulation 1 

Simulation 1. (A) Time series data for each simulated variable; (B) Time series Zi (i = 1, . . . , 14) represent 
variables. I, II and III are the sets of time series composed by Zi. Arrows represent the Granger 
causalities; (C) Granger causality between sets of time series. 

Figure 2 - Simulation 2 

Simulation 2. (A) The time series Wi (i = 1, . . . , 13) represent the variables. I, II and III are the sets of 
time series composed by Wi. Arrows represent the Granger causalities; (B) Granger causality between sets 
of time series. 

Figure 3 - Actual biological network 

Identification of Granger causality to a network composed by I: RECK, SRC, C-Myc, TIMP2; II: TP53, 
p21, GADD45A; III: FGFRL1, FGF2, FGFR2, FGFR1 and IV: NEMO, IKKA, 1KB and NFKB genes. 
Connectivities inside each set were identified by using the standard VAR model. Dashed arrows are 
significant connectivities with p < 0.1 and full arrows are significant connectivities with p < 0.05. 

Figure 4 - Illustrative diagram of the bootstrap. 

Time series 1, 2 and 3 belongs to the set Y£ while time series 4 and 5 belong to the Yj? set. Notice that the 
blocks may be overlapped. 

Tables 

Table 1 - Simulation 1. Number of Granger causalities obtained using our approach in 10,000 
simulations. The rate of false positives was controlled in 5%, i.e., it is expected to obtain <~ 500 
false-positives where there is no Granger causality. In bold, are the Granger causalities which actually 
exist. 



from/ to 
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II 


III 


I 


10,000 


8,543 


507 


II 


461 


9,979 


6,646 


III 


500 


8,015 


437 



Table 2 - Simulation 2. Number of Granger causalities obtained using our approach in 10,000 
simulations. In parenthesis are the number of Granger causalities obtained by standard VAR. The rate 
of false positives was controlled in 5%, i.e., it is expected to obtain ~ 500 false-positives where there 
is no Granger causality. In bold, are the Granger causalities which actually exist. 
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from/ to 
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II 


III 


I 


7,154 (1,079) 


10,000 (9,980) 


505 (589) 


II 


466 (547) 


10,000 (9,997) 


463 (522) 


III 


1,490 (1,015) 


10,000 (9,827) 


6,263 (2,107) 



Additional Files 
Additional file 1 — R code 

R code to identify Granger causality between gene sets. 
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